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Abstract 

The statistical analysis of massive and complex data sets will require 
the development of algorithms that depend on distributed computing 
and collaborative inference. Inspired by this, we propose a collab¬ 
orative framework that aims to estimate the unknown mean 0 of a 
random variable X. In the model we present, a certain number of 
calculation units, distributed across a communication network repre¬ 
sented by a graph, participate in the estimation of 9 by sequentially 
receiving independent data from X while exchanging messages via a 
stochastic matrix A defined over the graph. We give precise conditions 
on the matrix A under which the statistical precision of the individual 
units is comparable to that of a (gold standard) virtual centralized es¬ 
timate, even though each unit does not have access to all of the data. 
We show in particular the fundamental role played by both the non¬ 
trivial eigenvalues of A and the Ramanujan class of expander graphs, 
which provide remarkable performance for moderate algorithmic cost. 


Index Terms — Distributed computing, collaborative estimation, sto¬ 
chastic matrix, graph theory, complexity, Ramanujan graph. 
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1 Introduction 


A promising way to overcome computational problems associated with infer¬ 
ence and prediction in large-scale settings is to take advantage of distributed 
and collaborative algorithms, whereby several processors perform computa¬ 
tions and exchange messages with the end-goal of minimizing a certain cost 
function. For instance, in modern data analysis one is frequently faced with 
problems where the sample size is too large for a single computer or standard 
computing resources. Distributed processing of such large data sets is often 
regarded as a possible solution to data overload, although designing and an¬ 
alyzing algorithms in this setting is challenging. Indeed, good distributed 
and collaborative architectures should maintain the desired statistical ac¬ 
curacy of their centralized counterpart, while retaining sufficient flexibility 
and avoiding communication bottlenecks which may excessively slow down 
computations. The literature is too vast to permit anything like a fair sum¬ 
mary within the confines of a short introduction—the papers by Duchi et ah 
(2012), Jordan (2013), Zhang et ah (2013), and references therein contain a 
sample of relevant work. 

Similarly, the advent of sensor, wireless and peer-to-peer networks in sci¬ 
ence and technology necessitates the design of distributed and information- 
exchange algorithms (Boyd et ah, 2006; Predd et ah, 2009). Such networks 
are designed to perform inference and prediction tasks for the environments 
they are sensing. Nonetheless, they are typically characterized by constraints 
on energy, bandwidth and/or privacy, which limit the sensors’ ability to 
share data with each other or with a hub for centralized processing. For 
example, in a hospital network, the aim is to make safer decisions by shar¬ 
ing information between therapeutic services. However, a simple exchange 
of database entries containing patient details can pose information privacy 
risks. At the same time, a large percentage of medical data may require 
exchanging high-resolution images, the centralized processing of which may 
be computationally prohibitive. Overall, such constraints call for the design 
of communication-constrained distributed procedures, where each node ex¬ 
changes information with only a few of its neighbors at each time instance. 
The goal in this setting is to distribute the learning task in a computationally 
efficient way, and make sure that the statistical performance of the network 
matches that of the centralized version. 

The foregoing observations have motivated the development and analysis 
of many local message-passing algorithms for distributed and collaborative 
inference, optimization and learning. Roughly speaking, message-passing 
procedures are those that use only local communication to approximately 
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achieve the same end as global (i.e., centralized) algorithms, which reqnire 
sending raw data to a central processing facility. Message-passing algorithms 
are thonght to be efficient by virtne of their exploitation of local communi¬ 
cation. They have been successfully involved in kernel linear least-squares 
regression estimation (Predd et ah, 2009), support vector machines (Forero 
et ah, 2010), sparse Li regression (Mateos et ah, 2010), gradient-type opti¬ 
mization (Tsitsiklis et ah, 1986; Bertsekas and Tsitsiklis, 1997), and various 
online inference and learning tasks (Bianchi et ah, 2011a,b, 2013). An im¬ 
portant research effort has also been devoted to so-called averaging and con¬ 
sensus problems, where a set of autonomous agents—which may be sensors 
or nodes of a computer network—compute the average of their opinions in 
the presence of restricted communication capabilities and try to agree on a 
collective decision (e.g., Blondel et ah, 2005; Olshevsky and Tsitsiklis, 2011). 

However, despite their rising success and impact in machine learning, little 
is known regarding the statistical properties of message-passing algorithms. 
The statistical performance of collaborative computing has so far been stud¬ 
ied in terms of consensus (i.e., whether all nodes give the same result), with 
perhaps mean convergence rates (e.g., Olshevsky and Tsitsiklis, 2011; Duchi 
et ah, 2012; Zhang et ah, 2013). While it is therefore proved that using a 
network, even sparse (i.e., with few connections), does not degrade the rate 
of convergence, the problem of whether it is optimal to do this remains unan¬ 
swered, including for the most basic statistics. For example, which network 
properties guarantee collaborative calculation performances equal to those 
of a hypothetical centralized system? The goal of this article is to give a 
more precise answer to this fundamental question. In order to present in the 
clearest way possible the properties such a network must have, we undertake 
this study for the most simple statistic possible: the mean. 

In the model we consider, there are a number of computing agents (also 
known as nodes or processors) that sequentially estimate the mean of a 
random variable by regularly updating an estimate stored in their memory. 
Meanwhile, they exchange messages, thus informing each other about the 
results of their latest computations. Agents that receive messages use them 
to directly update the value in their memory by forming a convex combina¬ 
tion. We focus primarily on the properties that the communication process 
must satisfy to ensure that the statistical precision of a single processor—that 
only sees part of the data—is similar to that of an inaccessible centralized 
intelligence that could tackle the whole data set at once. The literature is 
surprisingly quiet on this question, which we believe is of fundamental im¬ 
portance if we want to provide concrete tradeoffs between communication 
constraints and statistical accuracy. 
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This paper makes several important contributions. First, in Section 2 we 
introduce communication network models and define a performance ratio 
allowing us to quantify the statistical quality of a network. In Section 3 we 
analyze the asymptotic behavior of this performance ratio as the number 
of data items t received online sequentially per node becomes large, and 
give precise conditions on communication matrices A so that this ratio is 
asymptotically optimal. Section 4 goes one step further, connecting the rate 
of convergence of the ratio with the behavior of the eigenvalues of A. In 
Section 5 we present the remarkable Ramanujan expander graphs and analyze 
the tradeoff between statistical efficiency and communication complexity for 
these graphs with a series of simulation studies. Lastly, Section 6 provides 
several elements for analysis of more complicated asynchronous models with 
delays. For clarity, proofs are gathered in Section 7. 


2 The model 


Let X be a square-integrable real-valued random variable, with EX = 0 and 
Var(X) = We consider a set {!,..., X} of computing entities (X > 2) 
that collectively participate in the estimation of 0. In this distributed model, 
agent i sequentially receives an i.i.d. sequence x|*\ ..., ..., distributed 

as the prototype X, and forms, at each time t, an estimate of 0. It is assumed 
throughout that the X^*^ are independent when both t > 1 and i E {1,..., X} 
vary. 

In the absence of communication between agents, the natural estimate held 
by agent i at time t is the empirical mean 



fc=i 


Equivalently, processor i is initialized with X 


(b 

1 


and performs its estimation 


via the iteration 


v(0 _ 

^t+l — 


tx 


(i) 


+v';’i 


t + 1 


t > 1. 


Let T denote transposition and assume that vectors are in column format. 
Letting X* = ..., Xj = (xY, ■ ■ ■, xY^Yj we see that 


Xi+i = 


tXt + X^ 


t+i 


t -\-1 


t > 1. 


( 2 . 1 ) 
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In a more complicated collaborative setting, besides its own measnrements 
and compntations, each agent may also receive messages from other proces¬ 
sors and combine this information with its own conclnsions. At its core, this 
message-passing process can be modeled by a directed graph ^ = ('^, S') 
with vertex set y = and edge set S. This graph represents 

the way agents commnnicate, with an edge from j to i (in that order) if j 
sends information to i. Fnrthermore, we have an x stochastic matrix 
A = (i.e., aij > 0 and for each i, = 1) with associ¬ 

ated graph i.e., a^j > 0 if and only if (j,i) G S. The matrix A acconnts 
for the way agents incorporate information dnring the collaborative process. 
Denoting by 6t = • • •, the collection of estimates held by the N 

agents over time, the compntation/combining mechanism is assnmed to be 
as follows: 

with 6i = ..., Thns, each individnal estimate is a convex 

combination of the estimates held by the agents over the network at time 
t, angmented by the new observation 

The matrix A models the way processors exchange messages and collaborate, 
ranging from A = (the N x N identity matrix, i.e., no commnnication) 
to A = ll''^/X (where 1 = (1,..., i.e., fnll commnnication). We note in 

particnlar that the choice A = Ijsf gives back iteration (2.1) with 6^ = 

We also note that, given a graph varions choices are possible for A. Thus, 
aside from a convenient way to represent a communication channel over which 
agents can retrieve information from each other, the matrix A can be seen 
as a “tuning parameter” on ^ to improve the statistical performance of 6t, 
as we shall see later. Important examples for A include the choices 


(I 1 \ 

1 0 1 


1 0 1 



1 0 1 

V 1 V 


( 2 . 2 ) 
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and 


/2 1 \ 

111 


111 



111 

V 1 y 

(unmarked entries are zero). It is easy to verify that for all f > 1, 


(2.3) 


1 t 1 

et = -J2A’^Xt_k. (2.4) 

k=0 

Thus, denoting by || ■ || the Euclidean norm (for vector or matrices), we may 
write, for all f > 1, 


E\\et-9lf = -E 


t-i 




k=0 


(since is a stochastic matrix) 

k=l 


by independence of Xi,..., X^. It follows that 


1 

E||0i - eif < E||Xi -eifx-J2 PIP 

k=0 

< E||Xi -0lf X J. 

In the last inequality, we used the fact that A^ is a stochastic matrix and thus 
||^fc||2 < jy fQj; k > 0. We can merely conclude that E||0i — 61\\^ —)■ 0 as 
f —>■ oo (mean-squared error consistency), and so —)■ 0 in probability for 

each i G {l,...,iV}. Put differently, the agents asymptotically agree on the 
(true) value of the parameter, independently of the choice of the (stochastic) 
matrix A —this property is often called consensus in the distributed opti¬ 
mization literature (see, e.g., Bertsekas and Tsitsiklis, 1997). 

The consensus property, although interesting, does not say anything about 
the positive (or negative) impact of the graph on the comparative perfor¬ 
mances of estimates with respect to a centralized version. To clarify this 
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remark, assume that there exists a centralized intelligence that could tackle 
all data ..., , x[^\ ..., xj:^^ at time t, and take advantage of 

these samples to assess the value of the parameter 6. In this ideal framework, 
the natural estimate of 6 is the global empirical mean 


^ N t 
i=l k=l 


which is clearly the best we can hope for with the data at hand. However, this 
estimate is to be considered as an unattainable “gold standard” (or oracle), 
insofar as it uses the whole (N x t)-sample. In other words, its evaluation 
requires sending all examples to a centralized processing facility, which is 
precisely what we want to avoid. 

Thus, a natural question arises: can the message-passing process be tapped to 
ensure that the individual estimates achieve statistical accuracy “close” 
to that of the gold standard Xjvi? Figure 1 illustrates this pertinent question. 


Message-passing {A — A2) 



0 50 100 150 200 250 300 350 400 

t 

No message-passing {A = I5) 



Figure 1: Convergence of individual nodes’ estimates with and without 
message-passing. 


In the trials shown, i.i.d. uniform random variables on [0,1] are delivered 
online to = 5 nodes, one to each at each time t. With message-passing 
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(here, A = A 2 ), each node aggregates the new data point with data it has seen 
previously and messages received from its nearest neighbors in the network. 
We see that all of the hve nodes’ updates seem to converge with a performance 
comparable to that of the (unseen) global estimate Xjvt to the mean 0.5. 
In contrast, in the absence of message-passing {A = I 5 ), individual nodes’ 
estimates do still converge to 0.5, but at a slower rate. 

To deal with this question of statistical accuracy satisfactorily, we first need 
a criterion to compare the performance of Ot with that of Xjvj. Perhaps the 
most natural one is the following ratio, which depends upon the matrix A\ 

E||(X„,-g)l|f ^ 

E||e,-Mp ’ “ 

The more this ratio is close to 1, the more the collaborative algorithm is 
statistically efficient, in the sense that its performance compares favorably 
to that of the centralized gold standard. In the remainder of the paper, we 
call Tt{A) the performance ratio at time t. 

Of particular interest in our approach is the stochastic matrix A, which 
plays a crucial role in the analysis. Roughly, a good choice for A is one for 
which Tt{A) is not too far from 1, while ensuring that communication over 
the network is not prohibitively expensive. Although there are several ways 
to measure “complexity” of the message-passing process, we have in mind a 
setting where the communication load is well-balanced between agents, in the 
sense that no node should play a dominant role. To formalize this idea, we 
dehne the communication-complexity index ^(A) as the maximal indegree 
of the edges of the graph ^ associated with A, i.e., the maximal number of 
edges pointing to a node in ^ (by convention, self-loops are counted twice 
when ^ is undirected). Essentially, A is communication-efficient when tf{A) 
is small with respect to N or, more generally, when ^(A) = 0(1) as N 
becomes large. 

To provide some context, '^(A) measures in a certain sense the “local” aspect 
of message exchanges induced by A. We have in mind node connection set¬ 
ups where '^(A) is small, perhaps due to energy or bandwidth constraints 
in the system’s architecture, or when for privacy reasons data must not be 
sent to a central node. Indeed, a large tf{A) roughly means that one or 
several nodes play centralized roles—precisely what we are trying to avoid. 
Furthermore, the decentralized networks we are interested in can be seen as 
being more autonomous than high-^(A) ones, in the sense that having few 
network connections means less things that can potentially break, as well 
as improved robustness due to the fact that the loss of one node does not 




lead to destruction of the whole system. As examples, the matrices Ai and 
A 2 dehned earlier have ^(Ai) = 3 and '^(^ 2 ) = 4, respectively, while the 
stochastic matrix A 3 below has ^(A^) = + 1 : 


/l 1 1 

1 A ^-1 

A3 = - 1 ^ - 1 

N . . 

V 


11 1 \ 


N-lJ 


(2.5) 


Thus, from a network complexity point of view, Ai and A 2 are preferable to 
A 3 where node 1 has the flavor of a central command center. 


Now, having dehned Tt(A) and ^(A), it is natural to suspect that there will 
be some kind of tradeoff between implementing a low-complexity message¬ 
passing algorithm (i.e., ^(A) small) and achieving good asymptotic perfor¬ 
mance (i.e., Tt(A) ^ 1 for large t). Our main goal in the next few sections is to 
probe this intuition by analyzing the asymptotic behavior of Tt(A) as f —)■ cx) 
under various assumptions on A. We start by proving that Tt(A) < 1 for all 
t > 1, and give precise conditions on the matrix A under which Tt(A) —)■ 1. 
Thus, thanks to the beneht of inter-agent communication, the statistical ac¬ 
curacy of individual estimates may be asymptotically comparable to that of 
the gold standard, despite the fact that none of the agents in the network 
have access to all of the data. Indeed, as we shall see, this stunning result is 
possible even for low-^(A) matrices. The take-home message here is that the 
communication process, once cleverly designed, may “boost” the individual 
estimates, even in the presence of severe communication constraints. We also 
provide an asymptotic development of Tt(A), which offers valuable informa¬ 
tion on the optimal way to design the communication network in terms of 
the eigenvalues of A. 


3 Convergence of the performance ratio 


Recall that a stochastic square matrix A = is irreducible if for 

every pair of indices i and j, there exists a nonnegative integer k such that 
is not equal to 0. The matrix is said to be reducible if it is not 
irreducible. 


Proposition 3.1. 

is reducible, then 


We have ^ < Tt(A) < 1 for all t > 1. In addition, if A 


A (A) < 1 


1 

N + V 


t > 1 . 
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It is apparent from the proof of the proposition (all proofs are found in Section 
7) that the lower bound 1/N for Tt{A) is achieved by taking A = In, which 
is clearly the worst choice in terms of communication. This proposition also 
shows that the irreducibility of A is a necessary condition for the collaborative 
algorithm to be statistically efficient, for otherwise there exists £ G (0,1) such 
that Tt{A) <1 — £ for all f > 1. 

We recall from the theory of Markov chains (e.g., Grimmett and Stirzaker, 
2001) that for a hxed agent i E {1,... ,N}, the period of i is the greatest 
common divisor of all positive integers k such that {A^)ii > 0. When A is 
irreducible, the period of every state is the same and is called the period 
of A. The following lemma describes the asymptotic behavior of Tt{A) as t 
tends to inhnity. 

Lemma 3.1. Assume that A is irreducible, and let d he its period. Then 
there exist projectors Qi,... ,Qd such that 

/ 1 

tAA) —)■ —1- as f —)■ cx). 

The projectors Qi ,..., in Lemma 3.1 originate from the decomposition 

d 

i=\ 7Gr 

where Ai = 1,..., are the (distinct) eigenvalues of A of unit modulus, T the 
set of eigenvalues of A of modulus strictly smaller than 1, and Q-y{k) certain 
N X N matrices (see Theorem 7.1 in the proofs section). In particular, we 
see that Tt{A) —)■ 1 as f —)■ cxd if and only if = 1- it turns out that 

this condition is satished if and only if A is irreducible, aperiodic (i.e., d = 1), 
and bistochastic, i.e., 1 nil {i,j) E {1,...,A^}^. 

This important result is encapsulated in the next theorem. 

Theorem 3.1. We have Tt{A) -E- 1 as t ^ oo if and only if A is irreducible, 
aperiodic, and bistochastic. 

Theorem 3.1 offers necessary and sufficient conditions for the communication 
matrix A to be asymptotically statistically efficient. Put differently, under 
the conditions of the theorem, the message-passing process conveys sufficient 
information to local computations to make individual estimates as accurate 
as the gold standard for large t. In the context of multi-agent coordination, 
an example of such a communication network is the so-called (time-invariant) 
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equal neighbor model (Tsitsiklis et ah, 1986; Olshevsky and Tsitsiklis, 2011), 
in which 

_f l/|iVW| ifjeiVW 

\ 0 otherwise, 

where 

7VW = {je{l,...,iV}:a,, >0} 

is the set of agents whose value is taken into account by i, and |iVh)| its 
cardinality. Clearly, the communication matrix A is stochastic, and also 
bistochastic as soon as A is symmetric (bidirectional model). Assuming in 
addition that the directed graph associated with A is strongly connected 
means that A is irreducible. Moreover, if an > 0 for some i G {1,..., A^}, 
then A is also aperiodic, so the conditions of Theorem 3.1 are fulfilled. 

It is interesting to note that there exist low-^(A) matrices that meet the 
requirements of Theorem 3.1. This is for instance the case of matrices Ai and 
A 2 in (2.2) and (2.3), which are irreducible, aperiodic and bistochastic, and 
satisfy ^(A) < 4. Also note that the matrix A 3 in (2.5), though irreducible, 
aperiodic and bistochastic, should be avoided because ^(As) = A^ + 1 . 

We stress that the irreducibility and aperiodicity conditions are inherent 
properties of the graph not A, insofar as these conditions do not depend 
upon the actual values of the nonzero entries of A. This is different for the 
bistochasticity condition, which requires knowledge of the coefficients of A. 
In fact, as observed by Sinkhorn and Knopp (1967), it is not always possible 
to associate such a bistochastic matrix with a given directed graph To 
be more precise, consider G = {gij)i<ij<N, the transpose of the adjacency 
matrix of the graph ^—that is, gtj G {0,1} and gij = 1 (j, z) G S’. 

Then G is said to have total support if, for every positive element gij, there 
exists a permutation a of ( 1 ,..., A^} such that j = a{i) and 5'fca-(fc) > 0 . 
The main theorem of Sinkhorn and Knopp (1967) asserts that there exists a 
bistochastic matrix A of the form A = D 1 GD 2 , where Di and D 2 are N x N 
diagonal matrices with positive diagonals, if and only if G has total support. 
The algorithm to induce A from G is called the Sinkhorn-Knopp algorithm. 
It does this by generating a sequence of matrices whose rows and columns 
are normalized alternately. It is known that the convergence of the algorithm 
is linear and upper bounds have been given for its rate of convergence (e.g., 
Knight, 2008). 

Nevertheless, if for some reason we face a situation where it is impossible 
to associate a bistochastic matrix with the graph Proposition 3.2 below 
shows that it is still possible to obtain information about the performance 
ratio, provided A is irreducible and aperiodic. 
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Proposition 3.2. Assume that A is irreducible and aperiodic. Then 

1 




where pL is the stationary distribution of A. 


as t —>■ oo, 


To illustrate this result, take N = 2 and consider the graph ^ with (sym¬ 
metric) adjacency matrix 11^ (i.e., full communication). Various stochastic 
matrices may be associated with each with a certain statistical perfor¬ 
mance. For a > 1 a given parameter, we may choose for example 


1/1 a - 1\ 
\1 a-l) 


When a = 2, we have Tt{H 2 ) —)■ 1 by Theorem 3.1. More generally, using 
Proposition 3.2, it is an easy exercise to prove that, as t —)■ oo. 




2 + 2(a - 1)2' 


We see that the statistical performance of the local estimates deteriorates as 
a becomes large, for in this case Tt{Ha) gets closer and closer to 1/2. This 
toy model exemplifies the role the stochastic matrix is playing as a “tuning 
parameter” to improve the performance of the distributed estimate. 


4 Convergence rates 


Theorem 3.1 gives precise conditions ensuring Tt{A) = 1 -|-o(l), but does not 
say anything about the rate (i.e., the behavior of the second-order term) at 
which this convergence occurs. It turns out that a much more informative 
limit may be obtained at the price of the mild additional assumption that 
the stochastic matrix A is symmetric (and hence bistochastic). 


Theorem 4.1. Assume that A is irreducible, aperiodic, and symmetric. Let 
1 > 72 > ■ ■ ■ > 7Af > —1 be the eigenvalues of A different from 1. Then 


n{,A) 


1 


1 + iE 


N I--if ■ 
^=2 1 - 7 | 


In addition, setting 


N 


^(A) = 




and r(y4) = max 


i=2 


2<1<N 
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we have, for all t > 1, 


t 


< rt{A) < 1 




2 


Clearly, we thus have 

t(l — Tt{A)) —>■ ^'{A) as t —)■ cx). 

The take-home message is that the smaller the coefficient S^{A), the better 
the matrix A performs from a statistical point of view. In this respect, we 
note that S^{A) > N — 1 (uniformly over the set of stochastic, irreducible, 
aperiodic, and symmetric matrices). Consider the full-communication matrix 

Ao = ilU, (4.1) 

which models a saturated communication network in which each agent shares 
its information with all others. The associated communication topology, 
which has = iV -|- 1, is roughly equivalent to a centralized algorithm 

and, as such, is considered inefficient from a computational point of view. On 
the other hand, intuitively, the amount of statistical information propagating 
through the network is large so A^{Ao) should be small. Indeed, it is easy 
to see that in this case, 7 ^ = 0 for all £ G {2,..., N} and A^(Aq) = N — 1. 
Therefore, although complex in terms of communication, Aq is statistically 
optimal. 

For a comparative study of statistical performance and communication com¬ 
plexity of matrices, let ns consider the sparser graph associated with the 
tridiagonal matrix Ai dehned in (2.2). With this choice, = cos 
(Fiedler, 1972), so that 


If-A 1 J \/'2 

WTTTFTF = as CX). (4.2) 

£=1 ^ N ^ 

Thus, we lose a power of N but now have lower communication complexity 
^(Ai) = 3. 

Let US now consider the tridiagonal matrix A2 dehned in (2.3). Noticing that 
3 A 2 = 2Ai + In, we deduce that for the matrix ^ 2 , 7 ^ = ^ -f | cos 
2 < i < N. Thus, as iV —)■ cx, 

S^(A2) = ^ + 0{N). ( 4 . 3 ) 
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By comparing (4.2) and (4.3), we can conclude that the matrices Ai and 
A 2 , which are both \ovif-^{A), are also nearly equivalent from a statistical 
efficiency point of view. A 2 is nevertheless preferable to Ai, which has a 
larger constant in front of the N'^. This slight difference may be due to 
the fact that most of the diagonal elements of Ai are zero, so that agents 
i G {2,..., iV — 1} do not integrate their current value in the next iteration, 
as happens for A 2 . Furthermore, for large N, the performance of Ai and 
A 2 are expected to dramatically deteriorate in comparison with those of Aq, 
since S^{Ai) and ^(^ 2 ) are proportional to iV^, while S^{Aq) is proportional 
to N. 


Figure 2 shows the evolution of Tt{A) for N fixed and t increasing for the 
matrices A = Aq, Ai, A 2 as well as the identity In- 




Figure 2: Evolution of Tt{Ai) with t for different values of N, for A = Aq, 
Ai, A 2 and In- 


As expected, we see convergence of Tt{Ai) to 1, with degraded performance 
as the number of agents N increases. Also, we see that the lack of message¬ 
passing for In means it is statistically inefficient, with constant Tt{lN) = 
for all t. 

The discussion and plots above highlight the crucial influence of ^{A) on 
the performance of the communication network. Indeed, Theorem 4.1 shows 


14 














that the optimal order for ^{A) is N, and that this scaling is achieved by 
the compntationally-inefiicient choice Aq —see (4.1). Thns, a natural ques¬ 
tion to ask is whether there exist communication networks that have A^{A) 
proportional to N and, simultaneously, '^{A) constant or small with respect 
to N. These two conditions, which are in a sense contradictory, impose that 
the absolute values of the non-trivial eigenvalues 'je stay far from 1, while 
the maximal indegree of the graph ^ remains moderate. It turns out that 
these requirements are satisfied by so-called Ramanujan graphs, which are 
presented in the next section. 


5 Ramanujan graphs 

In this section, we consider undirected graphs ^ that are also d- 

regular, in the sense that all vertices have the same degree d; that is each 
vertex is incident to exactly d edges. Recall that in this definition, self-loops 
are counted twice and multiple edges are allowed. However, in what follows, 
we restrict ourselves to graphs without self-loops and multiple edges. In 
this setting, the natural (bistochastic) communication matrix A associated 
with if is A = ^G, where G = (9ij)i<i,j<N is the adjacency matrix of if 
(gij G {0,1} and gij = 1 (i, j) G ^). Note that "^(A) = d. 

The matrix G is symmetric and we let d = pi > fi 2 > • ■ ■ > /iw > —d 
be its (real) eigenvalues. Similarly, we let 1 = 71 > 72 > • • • > 7 Ar > — 1 
be the eigenvalues of A, with the straightforward correspondence 7 * = g,i/d. 
We note that A is irreducible (or, equivalently, that ^ is connected) if and 
only if d > /r 2 (see, e.g., Shlomo et ah, 2006, Section 2.3). In addition, A 
is aperiodic as soon as /xat > —d. According to the Alon-Boppana theorem 
(Nilli, 1991) one has, for every d-regular graph, 

1^2 > 2Vd-l - OAr(l), 


where the 07 v(l) term is a quantity that tends to zero for every hxed d as 
N ^ 00 . Moreover, a d-regular graph if is called Ramanujan if 

max (l/i^l fie < d) < 2\/d — 1. 

In view of the above, a Ramanujan graph is optimal, at least as far as the 
spectral gap measure of expansion is concerned. Ramanujan graphs fall in 
the category of so-called expander graphs, which have the apparently contra¬ 
dictory features of being both highly connected and at the same time sparse 
(for a review, see Shlomo et ah, 2006). 
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Although the existence of Ramanujan graphs for any degree larger than or 
equal to 3 has been recently established by Marcus et al. (2015), their explicit 
construction remains difficult to use in practice. However, a conjecture by 
Alon (1986), proved by Friedman (2008) (see also Bordenave, 2015) asserts 
that most d-regular graphs are Ramanujan, in the sense that for every e: > 0, 

P^max (lh 2 |, |hv|) > 2\/d — 1 + —>■ 0 as iV —>■ oo, 

or equivalently, in terms of the eigenvalues of A, 

P^max (I 72 I, | 7 Ar|) > ^ ^ j —)■ 0 as iV —)■ cx). 

In both results, the limit is along any sequence going to inhnity with Nd even, 
and the probability is with respect to random graphs uniformly sampled in 
the family of d-regular graphs with vertex set Y = {l,...,iV}. 

In order to generate a random irreducible, aperiodic d-regular Ramanujan 
graph, we can hrst generate a random d-regular graph using an improved 
version of the standard pairing algorithm, proposed by Steger and Wormald 
(1999). We retain it if it passes the tests of being irreducible, aperiodic 
and Ramanujan as described above. Otherwise, we continue to generate a 
d-regular graph until all these conditions are satished. Figure 3 gives an 
example of a 3-regular Ramanujan graph with = 16 vertices, generated in 
this way. 



Figure 3: Randomly-generated 3-regular Ramanujan graph with iV = 16 
vertices. 

Now, given an irreducible and aperiodic communication matrix A associated 
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with a (i-regular Ramanujan graph we have, whenever d > 3, 

d2 

Thus, recalling that ^{A) > N — 1, we see that ^{A) scales optimally as N 
while having '^{A) = d (hxed). This remarkable superefficiency property can 
be compared with the full-communication matrix Aq, which has A^{Ao) = 
N — 1 but inadmissible complexity ^(Aq) = iV -|- 1. 

The statistical efficiency of these graphs is further highlighted in Figure 4. 
It shows results for 3- and 5-regular Ramanujan-type matrices (^3 and A^) 
as well as the previous results for non-Ramanujan-type matrices Aq, Ai and 
A 2 (see Figure 2). 




Figure 4: Evolution of Tt{Ai) with t for different values of N, for A = Ao, Ai, 
A 2 as before with the addition of 3- and 5-regular Ramanujan-type matrices 
A 3 and A 5 . 


We see that A 3 is already close to the statistical performance of Aq, the 
saturated network, and for all intents and purposes A 3 is essentially as good 
as Aq, even when there are N = 1000 nodes; i.e., the statistical performance 
of the 5-regular Ramanujan graph is barely distinguishable from that of the 
totally connected graph! Nevertheless, we must not forget that the possibility 
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of building such efficient networks in real-world situations will ultimately 
depend on the specific application, and may not always be possible. 

Next, assuming that the Ramanujan-type matrix A is irreducible and ape¬ 
riodic, it is apparent that there is a compromise to be made between the 
communication complexity of the algorithm (as measured by the degree in¬ 
dex ^(R) = d) and its statistical performance (as measured by the coefficient 
^(A)). Clearly, the two are in conflict. Upon this a question arises: is it 
possible to reach a compromise in the range of statistical performances ^ (R) 
while varying the communication complexity between d = 3 and d = N7 The 
answer is affirmative, as shown in the following simulation exercise. 

We fix = 200 and then for each d = 3,..., N: 

{{) Generate a matrix Ad associated with a d-regular Ramanujan graph as 
before. 

{a) Compute the (non-unitary) eigenvalues 72 “^^ ... , 7 ^^ of the matrix Ad 
and evaluate the sum 

^ 1 

SrUW 

{in) Plot ^{Ad) and fd^{Ad) = (3d as well as penalized sums ^{Ad) + 
f3'i^{Ad) for [3 G {1/2,1,2,4}, where [3 represents an explicit cost in¬ 
curred when increasing the number of connections between nodes. 



Results are shown in Figure 5, where d* refers to the d for which the penalized 
sum y'{Ad) + (3'i^{Ad) is minimized. We observe that S^{Ad) is decreasing 
whereas ^{Ad) increases linearly. The tradeoff between statistical efficiency 
and communication complexity can be seen as minimizing their penalized 
sum, where (3 for example represents a monetary cost incurred by adding 
new network connections between nodes. We see that the optimal d* and 
thus the number of node connections decreases as the cost of adding new 
ones increases. 

Next, let us investigate the tradeoffs involved in the case where we have a 
large but fixed total number T of data to be streamed to N nodes, each 
receiving one new data value from time t = 1 to time t = T/N. In this 
context, the natural question to ask is how many nodes should we choose, 
and how much communication should we allow between them in order to 
get “very good” results for a “low” cost? Here a low cost comes from both 
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Figure 5: Statistical efficiency vs communication complexity tradeoff for 
four different node communication penalties (3. d* is the d which minimizes 


limiting the number of nodes as well as the number of connections between 
them. 

In the same set-up for A^ defined above, one way to look at this is to ask, 
for each N, what is the smallest d G {3 ,... ,N} and therefore the smallest 
communication cost = d for which the performance ratio Tt{A^ is at 

least 0.99 after receiving all the data, i.e., when t = T/Nl Then, as there 
is also a cost associated with increasing iV, minimizing ^{Ad*)/N (where 
d* is this smallest d chosen) should help us choose the number of nodes N 
and the amount of connection between them. The result of this is 

shown in Figure 6 for T = 100 million data points. The minimum is found 
at {N,d*) = (710,3), suggesting that with 100 million data points, one can 
get excellent performance results {Tt{Ad*) > 0.99) for a low cost with around 
700 nodes, each connected only to three other nodes! Increasing N further 
raises the cost necessary to obtain the same performance, both due to the 
price of adding more nodes, as well as requiring more connections between 
them: d* must increase to 4, 5, and so on. 
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Figure 6: Minimizing the number of nodes N and the level of communication 
d required between nodes to obtain a performance ratio > 0.99 given 

a large fixed quantity of data T. 

6 Asynchronous models 

The models considered so far assume that messages from one agent to an¬ 
other are immediately delivered. However, a distributed environment may 
be subject to communication delays, for instance when some processors com¬ 
pute faster than others or when latency and finite bandwidth issues perturb 
message transmission. In the presence of such communication delays, it is 
conceivable that an agent will end up averaging its own value with an out¬ 
dated value from another processor. Situations of this type fall within the 
framework of distributed asynchronous computation (Tsitsiklis et ah, 1986; 
Bertsekas and Tsitsiklis, 1997). In the present section, we have in mind a 
model where agents do not have to wait at predetermined moments for pre¬ 
determined messages to become available. We thus allow some agents to 
compute faster and execute more iterations than others and allow communi¬ 
cation delays to be substantial. 

Communication delays are incorporated into our model as follows. For B a 
nonnegative integer, we assume that the last instant before t where agent j 
sent a message to agent i is t — Bij, where Bij G {0,..., H}. Put differently. 
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recalling that 9^^'^ is the estimate held by agent i at time t, we have 


N 


m _ 


t +1 


^ ^ ojj (t B-ij ) 9 


(i) , 

t-Bii ' 


t=l 


t + 1 


V+i. *>i- 


( 6 . 1 ) 


Thus, at time t, when agent i uses the value of another agent j, this value is 
not necessarily the most recent one but rather an outdated one 9^}^. ., 
where Bij represents the communication delay. The time instants t — Bij are 
deterministic and, in any case, 0 < Bij < B, i.e., we assume that delays are 
bounded. Notice that some of the values t — Bij in (6.1) may be negative— 
in this case, by convention we set 9^1^.. = 0. Our goal is to establish a 
counterpart to Theorem 3.1 in the presence of communication delays. As 
usual, we set 6 t = {9f\ ..., 9[^'^y. 

Let K{t) be the smallest ^ such that for all (fco, • • •, ^ {1, • • •, sat¬ 
isfying nj=i > 0 , we have 


e 

t — i — ^ < B. 

i=i 

Observe that t — i — Bj^._^k. is the last time before t when a message was 

sent from agent fco to agent ki via ki,, /c£_i. Accordingly, K{t) is nothing 
but the smallest number of transitions needed to return at a time instant 
earlier than B, whatever the path. We note that K{t) is roughly of order t, 
since 

1 , „ kU) nit) 

— - < hmmf- < hmsup- < 1 . 

-B + 1 t^OO t t^OQ t 


From now on, it is assumed that A = Ai, i.e., the irreducible, aperiodic, 
and symmetric matrix dehned in (2.2). Besides its simplicity, this choice is 
motivated by the fact that Ai is communication-efficient while its associated 
performance obeys 


n(A) 


1 - 


6 t 


for large t and N. The main result of the section now follows. 


Theorem 6.1. Assume that X is bounded and let A = Ai be defined as in 
(2.2). Then, as t ^ oo, 


E 


t 

K{t) 
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The advantages one hopes to gain from asynchronism are twofold. First, a 
redaction of the synchronization penalty and a potential speed advantage 
over synchronous algorithms, perhaps at the expense of higher communica¬ 
tion complexity. Second, a greater implementation flexibility and tolerance 
to system failure and uncertainty. On the other hand, the powerful result 
of Theorem 6.1 comes at the price of assumptions on the transmission net¬ 
work, which essentially demand that communication delays Bij are time- 
independent. In fact, we find that the introduction of delays considerably 
complicates the consistency analysis of Tt{A) even for the simple case of the 
empirical mean. This unexpected mathematical burden is due to the fact 
that the introduction of delays makes the analysis of the variance of the 
estimates quite complicated. 


7 Proofs 


We start this section by recalling the following important theorem, whose 
proof can be found for example in Foata and Fuchs (2004, Theorems 6.8.3 
and 6.8.4). Here and elsewhere, A stands for the stochastic communication 
matrix. 

Theorem 7.1. Let Ai,..., he the eigenvalues of A of unit modulus (with 
Ai = and F be the set of eigenvalues of A of modulus strictly smaller than 

1 . 


(i) There exist projectors Qi,... ,Qd such that, for all k > N, 

d 

e=i 7er 

where the matrices {Q^y^k) : fc > iV, 7 G F} satisfy Q.y{k)Qy{k') = 
Q.y{k -I- k') if 'j = 7', and 0 otherwise. In addition, for all 7 G F, 
limfc^oo7^Q7(^) = 0- 

{a) The seguence (H^)fc>o converges in the Cesdro sense to Qi, i.e.. 


1 

t 


k=0 


as t ^ 00 . 
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7.1 Proof of Proposition 3.1 

According to (2.4), since is a stochastic matrix, we have 



Therefore, it may be assnmed, without loss of generality, that 0 = 0. Thus, 


2 


E XATtl 


n{A) 


notW^ 


Next, let = {aif)i<i,j<N- Then, for each i G {1,..., iV}, 


JV 



k=0 j=l 


By independence of the samples. 


O t -1 N 



k=0 j=l 



NE{tmY 




t 



Since each A^ is a stochastic matrix, || < N and, by the Cauchy-Schwarz 

inequality, ||A^|| > 1. Thus, ^ < Tt{A) < 1, the lower bound being achieved 
when A is the identity matrix. 

Let us now assume that A is reducible, and let C C {!,..., N} be a recur¬ 
rence class. Arguing as above, we obtain that for all i & C, 



k=0 j=l 


k=ojec 
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Since C is a recurrence class, the restriction of A to entries in C is a stochastic 
matrix as well. Thus, setting iVi = \C\, by the Cauchy-Schwarz inequality. 


E(f y > 


2 

^ otherwise. 


To conclude. 


n{.A) = 


< 

< 


jt 

1 

1 + (iV - iVi)/iV 
N 

N + V 


since — A^i > 1 . 


7.2 Proof of Lemma 3.1 


As in the previous proof, we assume that 6 = 0. Recall that 

, t-i 


f>l. 


k=0 


Thus, for all f > 1, 


Eii0iir = ^E 


t-1 


Y^A’^Xt-k 


k=0 


t-1 






t—k 


k=0 


(by independence of Xi,..., X^) 

t-1 




ex7 ( ^(a'')^aMxi. 


fc =0 


Denote by Ai = 1,..., the eigenvalues of A of modulus 1, and let T be 
the set of eigenvalues 7 of A of modulus strictly smaller than 1 . According 
to Theorem 7.1, there exist projectors Qi,... ,Qd and matrices Q-y{k) such 
that for all A; > iV, 


.4‘ = 5jAjQ, + 5;7‘Q.,(*:). 

7 er 


£=1 
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Therefore, 


t-i 


t-i 


k=0 


k=0 

t-1 y d 

= E (E +E ) (E +E 


T / d 


7 er 


i=i 


7 er 


A:=0 ^ l=\ 
t-1 d 

= EE^'^j'3?'‘2j + °W' 

/c=0 

Here, we have used Cesaro’s lemma combined with the fact that for any 
7 G r, limfc_,.oo 7 ^Q 7 (A;) = 0 (Theorem 7.1). 

Since A is irreducible, according to the Perron-Frobenius theorem (e.g.. Grim- 

27 r'i(£ — 1 ) 

mett and Stirzaker, 2001, page 240), we have that = e a , 1 < £ < d. 
Accordingly, 

— 27 ri( 7 ’ —•^) 

= e—^ = l^j = i. 

Thus, 

t-1 d 

^tJ2QjQt + 0(1) + o(i). 


k=0 


e=i 


Letting Q = J2t=i QjQi^ 'w® obtain 


t-1 


k=0 


tE\\6tf = EXjQX, +EXj(- - Q ) Xi 

= EX7 QXi +o(l) 

d 

= ^E||Q,Xif+ o(l). 


(7.1) 


£=1 


Denoting by Qi^ij the (t,j)-entry of Qi, we conclude 


d N , N s 2 

iEiiSif = E ® E (E ) +0(1) 

e=i i=i A / 

d N 

= S ^lij + 

1=1 i,j=l 

(by independence of ..., xj'^^) 

d 

e=i 
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Lastly, recalling that E||XAril|p = we obtain 


r (A) = 1 ^ 1 


o(l). 


7.3 Proof of Theorem 3.1 


Sufficiency. Assume that A is irreducible, aperiodic, and bistochastic. The 
first two conditions imply that 1 is the unique eigenvalue of A of unit modulus. 
Therefore, according to Lemma 3.1, we only need to prove that the projector 
Qi satisfies ||Qi|| = 1. 


Since A is bistochastic, its stationary distribution is the uniform distribution 
on {1,... ,N}. Moreover, since A is irreducible and aperiodic, we have, as 

V 


IV i . 




/l 1 


VI 1 


By comparing this limit with that of the second statement of Theorem 7.1, 
we conclude by Cesaro’s lemma that 


/l 1 ... l\ 

VI 1 ••• V 


This implies in particular that ||(5i|| = 1. 

Necessity. Assume that Tt{A) tends to 1 as t — ?• cx). According to Propo¬ 
sition 3.1, A is irreducible. Thus, by Lemma 3.1, we have = 1- 

Observe, since each Qi is a projector, that \\Qi\\ > 1. Therefore, the iden¬ 
tity 1 implies d = 1 and ||Qi|| = 1. We conclude that A is 

aperiodic. 

Then, since A is irreducible and aperiodic, we have, as fc —)■ oo, 




( 


V^/ 


where n is the stationary distribution of A, represented as a row vector. 
Comparing once again this limit with the second statement of Theorem 7.1, 
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we see that 


Thus, \\Qif = N\\fir 


Qi 


VJ 


1. In particular, letting = (/ii,..., /itv), we have 


N N 

i=l i=l 

This is an equality case in the Cauchy-Schwarz inequality, from which we 
deduce that is the uniform distribution on Since n is the 

stationary distribution of A, this implies that A is bistochastic. 


7.4 Proof of Proposition 3.2 


If A is irreducible and aperiodic, then by Lemma 3.1, Tt{A) —>■ 
t —)■ oo. But, as fc —>■ oo, 

/ 


A^ 


1 

iiQiih 


as 


V^/ 


where the stationary distribution of 4 is represented as a row vector. By 
the second statement of Theorem 7.1, we conclude that ||Qi||^ = A^||^|p. 


7.5 Proof of Theorem 4.1 


Without loss of generality, assume that 6 = 0. Since A is irreducible and 
aperiodic, the matrix Q in the proof of Lemma 3.1 is Q = QjQi- Moreover, 
since A is also bistochastic, we have already seen that as fc —)■ oo. 


/l 1 ... l\ 

b 1 ... i) 


( 7 . 2 ) 


However, by the second statement of Theorem 7.1, the above matrix is equal 
to Qi- Thus, the projector Qi is symmetric, which implies Q = Qi. 
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Next, we deduce from (7.1) that 


EX7QXi + EX7(1 - Q)Xi 

ff2 + EX7(lEll‘„.4“-0)Xi’ 


(7.3) 


by symmetry of A and the fact that EX^QXi = a^. The symmetric matrix 
A can be put into the form 

A = UDU^, 

where 71 is a unitary matrix with real entries (so, U~^ = U~^) and D = 
diag(l, 72 ,..., 7 Af), with 1 > 72 > • • • > 7 Ar > —1. Therefore, as fc —)■ 00 , 


t-i 


k =0 


t-i 




k =0 


/l 0 

0 0 


However, by (7.2) and Cesaro’s lemma. 


t-i 


^ 2 fc Q as fc —)■ 00 . 


k =0 
T 


It follows that Q = UMU , where 


M = 


/l 0 
0 0 


o\ 

0 


\0 0 ... 0 / 


Thus, 


t-i 


t-i 


o\ 

0 


u^. 


\^0 0 ... 0/ 


-Q = U{-^D‘^^ - m\u^ 

^ k =0 fc=0 ' 

= t/diag(o, 

V ^1-77 ^ 1 - 7 ^ / 










Next, set 


1 1 _ 

«£ = Ti - 2 <£< iV , 

t 1 - 7| 

and let U = {uij)i<i j<N. With this notation, the (i,j)-entry of the matrix 

N 


UiiaiUjt. 


e=2 


Hence, 


t-i 


fc =0 


N N / N 


7 E -4“ - «) Xi = E •’f!’' E E ) N’- 

i=l j=l ^ 1=2 


Thus, 


t-l X N N 

EX7 ( ^ - g) Xi = cr^ 

j=l i=2 


k=0 


N N 


= a 


EE 

^=1 

N 

E 

i=2 
N 

E^ 




j=i e =2 

N 

= ai 

£=2 


a 1 - 7^ 
£=2 


t 1 - 


We conclude from (7.3) that 

rt{A) = 


1 _i_ 1 

“T t l^£=2 1- 




This shows the hrst statement of the theorem. Using the inequality > 
1 — X, valid for all x > 0, we have 


N 


£=2 


> 1 - 


y{A) 
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Finally, evoking the inequality < 1 — x + valid for all x > 0, we 
conclude 


Tt{A) < 1 


+ 1 _ ^2 ^ 





7.6 Proof of Theorem 6.1 


From now on, we fix feg ^ and let for any i G 

{1,..., A^}. Thus, for all f > 1, 


N 


yi^o) _ \ ^ 7(^1 _|_ -yi^o) 

- 2_^ akok^t-B^ok-l + ’ 


k=l 


and 


d'“’= E (7.4) 


N 


'(^ 2 ) 


N 


(fci) 


-(^o) 


fcl,fc2 = l 


fcl = l 


Our hrst task is to iterate this formula. To do so, we need additional notation. 
For i a positive integer and fc G {1,..., A^}, let ^{k) be the set of vectors in 
{!,..., of the form (fco, • • •, k) such that w{^{k)) > 0, where 

^(^)} ^koki^kik2 * * * 

In particular, by our choice of A, we have w{^(k)) = 2“^ for any k. Next, 
we set 


A(7^^(/c)) — £ + Bk^ki + Bk^k2 + ■ ■ ■ + Bk^_^ki_i + Bki,_^k- 

When i = 0, then by convention I^{k) = (fco), = 1 if = fco and 

0 otherwise, and A(^°(fc)) = 0. 

We are now ready to iterate (7.4). To do so, observe that 

N 

fc=l 

K(t)-1 N 

+ Z E Z “'(s'(^))y-’A,v(.)) 

£=0 fc=l K^(k) 

R\ + Rl (7.5) 
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By the definition of K{t), for all /c G {1,..., N}, t — < B. Since 

X is bounded, we deduce that there exists C > 0 such that 

N 

k=l K'^Wi^k) 


This implies that \Rl\ < C. To see this, note that jg ^ stochastic matrix 
and that for all /c G {1,..., N}, 

w(I_C‘'‘>(k)) = 


The analysis of the term is more delicate. The difficulty arises from 
the fact that this term is not a sum of independent random variables, and 
therefore its components must be grouped. Since each Bij is smaller than B 
and A{^(k)) = x implies x > £, we obtain 


R 


2 

t 


N (B+i)e 

E E E E Th 

e=0 k=l x=0 K^(k)-.A{K^{k))=x 
(B+1)(k(P-1) N X 

SEE 

x =0 k=l e=lx/{B+l)\+l Ki(k):A(K^{k))=x 


([■J is the floor function). By independence of the we get 


(B+l)(K(t)-l) AT X 

Var(i?2) = (T^ S w{K\k)) 

af=0 k=\ i=\x/{k))=x 


2 


Recalling that w{^{k)) = 2 we obtain 


Var(R2) = d' 


(B+l)(K(t)-l) N 

E E 

tc=0 fc=l 


X 


e.=[x/{B+i)\+i 


1 


K\k) : A{K\k)) =x 


Next, consider the Markov chain (En)n>o with transition matrix A such that 
Yq = ko- Observe that 



i 

i=i 


1 


K\k) : A{K\k)) 
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Moreover, for fixed x, the events 



are disjoint since the Bij are nonnegative. Thus, 


iZ ]i\K‘{k):A(K‘{k))=x\<l, 


e.= [x/{B+i)\+i 


and so. 


(B+l)(K(t)-l) N 


Var(fi?) < a^N{{B + l)K{t)-B). (7.6) 


x=0 k=l 


The expectation of Rf is easier to compute. Indeed, since each is a 
stochastic matrix. 


N 


K{t)-1 N 


Efl? = 9 E E E HK'(k)) = e E E(^')‘«*=»'=(*)■ 


£=0 k=l K‘(k) 


e=o k=i 


Combining (7.5), (7.6), and the fact that < C, we obtain 



2 


The result follows from the identity 1/K{t) = 0{l/t). 
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